{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Describing the population of grain sizes\n",
    "\n",
    "The method to describe the properties of the grain size population is named ``summarize()``. Before we get into the details of the method, let's run the GrainSizeTools script, load the example dataset, and create a toy dataset with known parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "module plot imported\nmodule averages imported\nmodule stereology imported\nmodule piezometers imported\nmodule template imported\n\n======================================================================================\nWelcome to GrainSizeTools script\n======================================================================================\nA free open-source cross-platform script to visualize and characterize grain size\npopulation and estimate differential stress via paleopizometers.\n\nVersion: v3.0.2 (2020-12-30)\nDocumentation: https://marcoalopez.github.io/GrainSizeTools/\n\nType get.functions_list() to get a list of the main methods\n\n"
     ]
    }
   ],
   "source": [
    "# Load the script first (change the path to GrainSizeTools_script.py accordingly!)\n",
    "%run C:/Users/marco/Documents/GitHub/GrainSizeTools/grain_size_tools/GrainSizeTools_script.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "         Area  Circ.    Feret  FeretX  FeretY  FeretAngle  MinFeret     AR  \\\n",
       "0  1   157.25  0.680   18.062  1535.0     0.5     131.634    13.500  1.101   \n",
       "1  2  2059.75  0.771   62.097   753.5    16.5     165.069    46.697  1.314   \n",
       "2  3  1961.50  0.842   57.871   727.0    65.0      71.878    46.923  1.139   \n",
       "3  4  5428.50  0.709  114.657  1494.5    83.5      19.620    63.449  1.896   \n",
       "4  5   374.00  0.699   29.262  2328.0    34.0      33.147    16.000  1.515   \n",
       "\n",
       "   Round  Solidity  diameters  \n",
       "0  0.908     0.937  14.149803  \n",
       "1  0.761     0.972  51.210889  \n",
       "2  0.878     0.972  49.974587  \n",
       "3  0.528     0.947  83.137121  \n",
       "4  0.660     0.970  21.821815  "
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th></th>\n      <th>Area</th>\n      <th>Circ.</th>\n      <th>Feret</th>\n      <th>FeretX</th>\n      <th>FeretY</th>\n      <th>FeretAngle</th>\n      <th>MinFeret</th>\n      <th>AR</th>\n      <th>Round</th>\n      <th>Solidity</th>\n      <th>diameters</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>157.25</td>\n      <td>0.680</td>\n      <td>18.062</td>\n      <td>1535.0</td>\n      <td>0.5</td>\n      <td>131.634</td>\n      <td>13.500</td>\n      <td>1.101</td>\n      <td>0.908</td>\n      <td>0.937</td>\n      <td>14.149803</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>2059.75</td>\n      <td>0.771</td>\n      <td>62.097</td>\n      <td>753.5</td>\n      <td>16.5</td>\n      <td>165.069</td>\n      <td>46.697</td>\n      <td>1.314</td>\n      <td>0.761</td>\n      <td>0.972</td>\n      <td>51.210889</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>1961.50</td>\n      <td>0.842</td>\n      <td>57.871</td>\n      <td>727.0</td>\n      <td>65.0</td>\n      <td>71.878</td>\n      <td>46.923</td>\n      <td>1.139</td>\n      <td>0.878</td>\n      <td>0.972</td>\n      <td>49.974587</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>5428.50</td>\n      <td>0.709</td>\n      <td>114.657</td>\n      <td>1494.5</td>\n      <td>83.5</td>\n      <td>19.620</td>\n      <td>63.449</td>\n      <td>1.896</td>\n      <td>0.528</td>\n      <td>0.947</td>\n      <td>83.137121</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>374.00</td>\n      <td>0.699</td>\n      <td>29.262</td>\n      <td>2328.0</td>\n      <td>34.0</td>\n      <td>33.147</td>\n      <td>16.000</td>\n      <td>1.515</td>\n      <td>0.660</td>\n      <td>0.970</td>\n      <td>21.821815</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "# Load the example dataset\n",
    "filepath = 'C:/Users/marco/Documents/GitHub/GrainSizeTools/grain_size_tools/DATA/data_set.txt'\n",
    "dataset = pd.read_csv(filepath, sep='\\t')\n",
    "\n",
    "# estimate equivalent circular diameters (ECDs)\n",
    "dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "sample size = 500\n"
     ]
    }
   ],
   "source": [
    "# Set the population properties\n",
    "scale = np.log(20)  # set sample geometric mean to 20\n",
    "shape = np.log(1.5)  # set the lognormal shape to 1.5\n",
    "\n",
    "# generate a random lognormal population of size 500\n",
    "np.random.seed(seed=1)  # this is to generate always the same population for reproducibility\n",
    "toy_dataset = np.random.lognormal(mean=scale, sigma=shape, size=500)\n",
    "print('sample size =', len(toy_dataset))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to check what we can get from the function ``summarize()``. The simplest example of use would be to pass the data containing the diameters. For simplicity's sake, let's do it with the toy dataset first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      " \n============================================================================\nCENTRAL TENDENCY ESTIMATORS\n============================================================================\nArithmetic mean = 22.13 microns\nConfidence intervals at 95.0 %\nmCox method: 21.35 - 22.98 (-3.5%, +3.8%), length = 1.623\n============================================================================\nGeometric mean = 20.44 microns\nConfidence interval at 95.0 %\nCLT method: 19.73 - 21.17 (-3.5%, +3.6%), length = 1.441\n============================================================================\nMedian = 20.32 microns\nConfidence interval at 95.0 %\nrobust method: 19.33 - 21.42 (-4.9%, +5.4%), length = 2.096\n============================================================================\nMode (KDE-based) = 17.66 microns\nMaximum precision set to 0.1\nKDE bandwidth = 2.78 (silverman rule)\n \n============================================================================\nDISTRIBUTION FEATURES\n============================================================================\nSample size (n) = 500\nStandard deviation = 9.07 (1-sigma)\nInterquartile range (IQR) = 11.44\nLognormal shape (Multiplicative Standard Deviation) = 1.49\n============================================================================\nShapiro-Wilk test warnings:\nData is not normally distributed!\nNormality test: 0.88, 0.00 (test statistic, p-value)\n============================================================================\n"
     ]
    }
   ],
   "source": [
    "summarize(toy_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the ```summarize()``` function returns:\n",
    "\n",
    "- Different **central tendency estimators** (\"averages\") including the arithmetic and geometric means, the median, and the KDE-based mode (i.e. frequency peak).\n",
    "- The **confidence intervals** for the different means and the median at 95% of certainty in absolute value and percentage relative to the average (*a.k.a* coefficient of variation). The meaning of these intervals is that, given the observed data, there is a 95% probability (one in 20) that the true value of grain size falls within this credible interval. The function provides the lower and upper bounds of the confidence interval, the error in percentage respect to the average, and the interval length. \n",
    "- The methods used to estimate the confidence intervals for each average (excepting for the mode). The function ```summarize()``` automatically choose the optimal method depending on distribution features (see below)\n",
    "- The sample size and two population dispersion measures: the (Bessel corrected) [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) and the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range).\n",
    "- The shape of the lognormal distribution using the multiplicative standard deviation (MSD)\n",
    "- A Shapiro-Wilk test warning indicating when the data deviates from normal and/or lognormal (when p-value < 0.05).\n",
    "\n",
    "In the example above, the Shapiro-Wilk test tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic random dataset, 20 and 1.5 respectively. Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      " \n============================================================================\nCENTRAL TENDENCY ESTIMATORS\n============================================================================\nArithmetic mean = 34.79 microns\nConfidence intervals at 95.0 %\nCLT (ASTM) method: 34.09 - 35.48, (±2.0%), length = 1.393\n============================================================================\nGeometric mean = 30.10 microns\nConfidence interval at 95.0 %\nCLT method: 29.47 - 30.75 (-2.1%, +2.2%), length = 1.283\n============================================================================\nMedian = 31.53 microns\nConfidence interval at 95.0 %\nrobust method: 30.84 - 32.81 (-2.2%, +4.1%), length = 1.970\n============================================================================\nMode (KDE-based) = 24.31 microns\nMaximum precision set to 0.1\nKDE bandwidth = 4.01 (silverman rule)\n \n============================================================================\nDISTRIBUTION FEATURES\n============================================================================\nSample size (n) = 2661\nStandard deviation = 18.32 (1-sigma)\nInterquartile range (IQR) = 23.98\nLognormal shape (Multiplicative Standard Deviation) = 1.75\n============================================================================\nShapiro-Wilk test warnings:\nData is not normally distributed!\nNormality test: 0.94, 0.00 (test statistic, p-value)\nData is not lognormally distributed!\nLognormality test: 0.99, 0.03 (test statistic, p-value)\n============================================================================\n"
     ]
    }
   ],
   "source": [
    "summarize(dataset['diameters'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Leaving aside the different numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/avg_map.png?raw=true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reason why the CLT method applies in this case is that the grain size distribution is not sufficiently close to a logarithmic distribution (note the Shapiro-Wilk test warning with a p-value < 0.05), and this might cause an inaccurate estimate of the arithmetic mean confidence interval.\n",
    "\n",
    "Now, let's focus on the different options of the ``summarize()`` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "text": [
      "\u001b[1;31mSignature:\u001b[0m\n",
      "\u001b[0msummarize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m\n",
      "\u001b[0m    \u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
      "\u001b[0m    \u001b[0mavg\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'amean'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'gmean'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'median'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'mode'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
      "\u001b[0m    \u001b[0mci_level\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0.95\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
      "\u001b[0m    \u001b[0mbandwidth\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'silverman'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
      "\u001b[0m    \u001b[0mprecision\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0.1\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
      "\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mDocstring:\u001b[0m\n",
      "Estimate different grain size statistics. This includes different means,\n",
      "the median, the frequency peak grain size via KDE, the confidence intervals\n",
      "using different methods, and the distribution features.\n",
      "\n",
      "Parameters\n",
      "----------\n",
      "data : array_like\n",
      "    the size of the grains\n",
      "\n",
      "avg : string, tuple or list; optional\n",
      "    the averages to be estimated\n",
      "\n",
      "    | Types:\n",
      "    | 'amean' - arithmetic mean\n",
      "    | 'gmean' - geometric mean\n",
      "    | 'median' - median\n",
      "    | 'mode' - the kernel-based frequency peak of the distribution\n",
      "\n",
      "ci_level : scalar between 0 and 1; optional\n",
      "    the certainty of the confidence interval (default = 0.95)\n",
      "\n",
      "bandwidth : string {'silverman' or 'scott'} or positive scalar; optional\n",
      "    the method to estimate the bandwidth or a scalar directly defining the\n",
      "    bandwidth. It uses the Silverman plug-in method by default.\n",
      "\n",
      "precision : positive scalar or None; optional\n",
      "    the maximum precision expected for the \"peak\" kde-based estimator.\n",
      "    Default is 0.1. Note that this is not related with the confidence\n",
      "    intervals\n",
      "\n",
      "Call functions\n",
      "--------------\n",
      "- amean, gmean, median, and freq_peak (from averages)\n",
      "\n",
      "Examples\n",
      "--------\n",
      ">>> summarize(dataset['diameters'])\n",
      ">>> summarize(dataset['diameters'], ci_level=0.99)\n",
      ">>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode'))\n",
      "\n",
      "Returns\n",
      "-------\n",
      "None\n",
      "\u001b[1;31mFile:\u001b[0m      c:\\users\\marco\\documents\\github\\grainsizetools\\grain_size_tools\\grainsizetools_script.py\n",
      "\u001b[1;31mType:\u001b[0m      function\n"
     ],
     "name": "stdout"
    }
   ],
   "source": [
    "?summarize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **TODO:**\n",
    "- explain the different options of ``summarize()`` through examples\n",
    "- examples using log-transformed populations"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}